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Discrete Wasserstein Barycenters: 
Optimal Transport for Discrete Data 


Ethan Anderes • Steffen Borgwardt • Jacob Miller 


Abstract Wasserstein barycenters correspond to optimal solutions of transportation problems for 
several marginals, and as such have a wide range of applications ranging from economics to statistics 
and computer science. When the marginal probability measures are absolutely continuous (or vanish 
on small sets) the theory of Wasserstein barycenters is well-developed (see the seminal paper |l]). 
However, exact continuous computation of Wasserstein barycenters in this setting is tractable in 
only a small number of specialized cases. Moreover, in many applications data is given as a set of 
probability measures with finite support. In this paper, we develop theoretical results for Wasserstein 
barycenters in this discrete setting. Our results rely heavily on polyhedral theory which is possible 
due to the discrete structure of the marginals. 

Our results closely mirror those in the continuous case with a few exceptions. In this discrete 
setting we establish that Wasserstein barycenters must also be discrete measures and there is always 
a barycenter which is provably sparse. Moreover, for each Wasserstein barycenter there exists a non¬ 
mass-splitting optimal transport to each of the discrete marginals. Such non-mass-splitting transports 
do not generally exist between two discrete measures unless special mass balance conditions hold. 
This makes Wasserstein barycenters in this discrete setting special in this regard. 

We illustrate the results of our discrete barycenter theory with a proof-of-concept computation 
for a hypothetical transportation problem with multiple marginals: distributing a fixed set of goods 
when the demand can take on different distributional shapes characterized by the discrete marginal 
distributions. A Wasserstein barycenter, in this case, represents an optimal distribution of inventory 
facilities which minimize the squared distance/transportation cost totaled over all demands. 

Keywords barycenter • optimal transport • multiple marginals • polyhedral theory • mathematical 
programming 
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1 Introduction 


Optimal transportation problems with multiple marginals are becoming important in applications 
ranging from economics and finance [2li71[9lfl2l to condensed matter physics and image processing 
[6l ll0lll31l221l24j . The so-called Wasserstein barycenter corresponds to optimal solutions for these 
problems, and as such has seen a flurry of recent activity (see [Dlll3!i[IIl[MIIIllIl|2ni|l9l|2T]|25| ). 
Given probability measures Pi,, P^ on R d , a Wasserstein barycenter is any probability measure 
P on R d which satisfies 

N N 


J2w 2 (P,p l f = 


= inf 


pg-p2(Rd) 




(i) 


where W 2 denotes the quadratic Wasserstein distance and P 2 (R <i ) denotes the set of all probability 
measures on R rf with finite second moments. See the excellent monographs mm for a review of the 
Wasserstein metric and optimal transportation problems. 
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Fig. 1: The above four images represent hypothetical monthly demands (as a percentage of total supply) for 
distributing a fixed set of goods to nine California cities (denoted by red ‘x’ marks) in four different months (February, 
March, June and July). Percent demand within each month is plotted proportional to disk area and is computed from 
monthly average temperature and population within each city (see Section [4] for details). When percent demand 
is treated as a discrete probability distribution, one for each month, the Wasserstein barycenter represents the 
optimal distribution of inventory facilities which minimize total squared distance/transportation cost over multiple 
monthly demand requirements. This example serves to illustrate the applicability of the main theoretical properties 
derived in this paper. Theorem [2] for example, establishes that the optimal inventory distribution is a sparse 
discrete probability distribution with tight bounds on the scarcity of the barycenter support. In particular, the 
optimal inventory facilities are located at a small number of sites with relatively large storage capacity, rather than 
a large number small-capacity facilities distributed over a diffuse set of locations. Theorem^shows that the optimal 
transportation plan assigns each to barycenter inventory facility exactly one city to supply each month. Indeed, this 
type of non-mass-splitting property of optimal mass transportation is known for absolutely continuous probability 
distributions but does not usually hold for discrete probability distributions. The discrete Wasserstein barycenter is 
unique in this regard: there always exists a non-mass-splitting optimal transportation plan to each of the individual 
probability distributions (represented by monthly demand in this example). The Wasserstein barycenter for this 
example is shown in Figure [2] and some of the optimal transportation plans are shown in Figure [3] Finally, the 
computational details of this example are presented in Section [4l 


Much of the recent activity surrounding Wasserstein barycenters stems, in part, from the seminal 
paper pQ. In that paper, Agueh and Carlier establish existence, uniqueness and an optimal transport 
characterization of P when P\ ,p N have sufficient regularity (those which vanish on small sets 
or which have a density with respect to Lebesgue measure). The transportation characterization 
of P, in particular, provides a theoretical connection with the solution of Q and the estimation 
of deformable templates used in medical imaging and computer vision (see |13l m and references 
therein). Heuristically, any measure P is said to be a deformable template if there exists a set of 
deformations ipi,... ,ipn which push-forward P to Pi , P N , respectively, and are all “as close as 
possible” to the identity map. Using a quadratic norm on the distance of each map tpi(x),... ,<pn(x) 
to x, a deformable template P then satisfies 


P e arg inf 

PGV 2 (K d ) 


N r 

inf M*) 

UVl, ■ ■ ■ > <Pn) 

. S.t. <Pi(P) = Pi} 


x\ 2 dP(x) 


( 2 ) 


The results of Agueh and Carlier establish that 0 and 0 share the same solution set when 
Pi ,..., P N have densities with respect to Lebesgue measures (for example). 

While absolutely continuous barycenters are mathematically interesting, in practice, data is often 
given as a set of discrete probability measures Pi ,..., P N , i.e. those with finite support in R d . For 
example, in Figure [I] the discrete measures denote different demand distributions over 9 California 
cities for different months (this example is analyzed in detail in Section [4| . For the remainder of the 
paper we refer to a discrete Wasserstein barycenter as any probability measure P which satisfies ([!]) 
and where all the Pi ,P N have discrete support. 

In this paper we develop theoretical results for discrete Wasserstein barycenters. Our results 
closely mirror those in the continuous case with a few exceptions. In the discrete case, the uniqueness 
and absolute continuity of the barycenter is lost. More importantly, however, is the fact that P is 
provably discrete when the marginals are discrete (see Proposition [l]). This guarantees that finite- 
dimensional linear programming will yield all possible optimal P, and this in turn is utilized in this 














Discrete Wasserstein Barycenters: Optimal Transport for Discrete Data 


3 



Fig. 2: The leftmost image shows a Wasserstein barycenter computed from 8 discrete probability distributions, each 
representing a different monthly demand (4 of the months are shown in Figure[lj. Notice that barycenter support is 
extremely sparse—supported on 63 discrete locations—as compared to the 12870 possible barycenter support points 
(shown in the rightmost image) guaranteed by Proposition [l] Notice that Theorem [5] gives an upper bound of 65 
support points for the optimal Wasserstein barycenter shown here. The role of Proposition [T] on the other hand, is 
to give a finite set inclusion bound on the possible barycenter support points (shown at right in this example). This 
result yields the finite dimensional linear program characterization of optimal Wasserstein barycenters which is key 
to the analysis presented in this paper. 


paper to study the properties of these barycenters from the point of view of polyhedral theory. In doing 
so, we find remarkable differences and similarities between continuous and discrete barycenters. In 
particular, unlike the continuous case, there is always a discrete barycenter with provably sparse finite 
support; however, analogously to the continuous case, there still exists non-mass-splitting optimal 
transports from the discrete barycenter to each discrete marginal. Such non-mass-splitting transports 
generally do not exist between two discrete measures unless special mass balance conditions hold. 
This makes discrete barycenters special in this regard. 

In Section [2j we introduce the necessary formal notation and state our main results. The corre¬ 
sponding proofs are found in Section 3. To illustrate our theoretical results we provide a computational 
example, dicussed in Section[4]and Figures[l][3j for a hypothetical transportation problem with multi¬ 
ple marginals: distributing a fixed set of goods when the demand can take on different distributional 
shapes characterized by Pi ,..., P N . A Wasserstein barycenter, in this case, represents an optimal 
distribution of inventory facilities which minimize the squared distance/transportation cost totaled 
over all demands Pi ,..., P N . 


2 Results 


For the remainder of this paper Pi,..., P N will denote discrete probability measures on R d with finite 
second moments. Let P 2 (R d ) denote the space of all probability measures with finite second moments 
on R d . Recall, a Wasserstein barycenter P is an optimizer to the problem 


inf 

Per 2 (R d ) 


N 

J2w 2 (p,p t f. 

i=1 


( 3 ) 


The first important observation is that all optimizers of |3j) must be supported in the finite set S C R d 
where 

S = I '' 1 + ■ n + Xn | Xi £ supp(Pi)} (4) 

is the set of all possible centroids coming from a combination of support points, one from each measure 
Pi- In particular, letting P^(R d ) = {P € P 2 (R d )| supp(P) C S} the infinite dimensional problem ^ 
can be solved by replacing the requirement P e P 2 (R d ) with P € T§ (R d ) to yield a finite dimensional 
minimization problem. This result follows from Proposition [l] below. 
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Fig. 3: These two plots illustrate the special property of discrete Wasserstein barycenters proved in Theorem |]J 
there is no mass-splitting when optimally transporting the inventory at each barycenter support to the corresponding 
demand for each month. The image at left shows all the transported mass flowing from the optimal barycenter into 
San Francisco, Sacramento, Los Angeles and San Bernardino for month of March (the corresponding March demand 
is shown middle-left in Figure^). The image at right shows the corresponding optimal transport for the month of 
July. Notice that these figures only show the barycenter support points which transport into the four cities shown 
here. The other barycenter supports transport goods to the other five cities not shown. We remark that Theorem 
[T] also establishes that transportation is balanced so that the transportation displacements sum to zero at each 
barycenter support point. 


Proposition 1 Suppose Pi,...,P/v are discrete probability measures on M . d . Let II(Pi , Pn) denote 
the set of all coupled random vectors (X%,.. ., Xm) with marginals X j ~ Pj and let X denote the coordinate 
average A ' 1+ 'jv~ l ~' Yw ■ Let S be defined as in m|). 

i) There exists (X °,..., X'fij) £ II (Pi ,..., Pm) such that 

E\X°\ 2 = sup F|x| 2 . (5) 

(Xl-.-.A-jv) 

e n(p u ...,p N ) 


ii) Any (X °,..., Xfj) £ II (Pi ,..., Pm) which satisfies has supp(CX°) C S and 


N 


N 


N 


J2W2{£X°,P z ) 2 = 


= inf 


PeP 2 (R d ) 


J2w 2 (p,p z ) 2 = 


= inf 


P£Pg(WL d ) 


Y, W 2{P,Pif 


( 6 ) 


where CX° denotes the distribution (or law) of X°. 

N 

in) Any P £ arg min V W 2 (P, PA 2 satisfies supp(P) C S. 

PeP 2 (R d ) i = 1 

Notice that the existence of (X °,..., X ^f), in partof the above proposition, follows immediately 
from the general results found in Kellerer m and Rachev [23]. Parts and[iii|l are proved in Section 
[3] We also remark that during the preparation of this manuscript the authors became aware that 
Proposition [l] was independently noted in [8j, with a sketch of a proof. For completeness we will 
include a detailed proof of this statement which will also provide additional groundwork for Theorem 
[T]and Theorem [2] below. 

Proposition [l] guarantees that any barycenter P computed with discrete marginals has the form 

P = ^ 2x<5x, 2x £ R>0- (7) 

xG5 


Here 5 X is the Dirac-<5-function at x £ R d and z x corresponds to the mass (or probability) at x. 
This implies that any coupling of P with Pi, which realizes the Wasserstein distance, is in fact 
characterized by a finite matrix. Treating the coordinates of these matrices and the values z x as 
variables, the set of all solutions to 0 are obtained through a finite-dimensional linear program (see 
(231 below). In [8] a similar linear program was used to find approximate barycenters for sets of 
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absolutely continuous measures by finitely approximating the support of P (which is sub-optimal for 
the continuous problem). Our use of the finite linear program characterization of P is different from 
continuous approximation. We use a version of the linear program to analyze properties of discrete 
barycenters themselves. Indeed, since the set of all discrete barycenters is on a face of the underlying 
polyhedron, one can study their properties by means of polyhedral theory. 

Our first theorem illustrates a similarity between barycenters defined from absolutely continuous 
Pi,, P/v and barycenters defined in the discrete setting. The results of p] establish, in the absolutely 
continuous case, that there exist optimal transports from the barycenter to each P, which are optimal 
in the sense of Wasserstein distance and are gradients of convex functions. Theorem [l] shows that 
such transports not only exist for discrete barycenters but also share similar properties. 

Theorem 1 Suppose Pi...., P]y are discrete probability measures. Let P denote a Wasserstein barycenter 
solution to uV and let X be a random variable with distribution P. Then there exist finite convex functions 
ifi : R^ —> R“, for each i = 1,..., N, such that 

i) V^i(P) = Pi, Vt. 

ii) E\X - ViffiX )| 2 = W 2 (P,P l ) 2 , Vi 

1 N 

in) ^2 Vifi(xj) = Xj, Vxj € supp(P). 

2=1 

1 N \x -\ 2 

iv ) ]vX>(*i) = “ 2 “’ yx i £ supp{P). 

i =1 

Intuitively, one would expect the support of a barycenter to be large to accommodate such a con¬ 
dition. This is particularly plausible since such these transports must realize the Wasserstein distance 
between each measure and the barycenter. However, it has been noted that the barycenters of dis¬ 
crete measures are often sparse in practice; see for example nu. Our second main result resolves this 
tension and establishes that there always is a Wasserstein barycenter whose solution is theoretically 
guaranteed to be sparse. 


Theorem 2 Suppose Pi,..., Pjy are discrete probability measures, and let Si = \supp(Pi)\. Then there 
exists a barycenter P of these measures such that 

N 

\supp(P)\<J2 S i~ N + 1 - (8) 

i=1 

We would like to stress how low this guaranteed upper bound on the size of the support of the 
barycenter actually is. For example, let every P,; have a support of the same cardinality T. Then 
| S\ < T n and if the support points are in general position one has |S| = T N . In contrast, the support 
of the barycenter has cardinality at most NT. 

Additionally, the bound in Theorem [2] is the best possible in the sense that, for any natural 
numbers N and W , it is easy to come up with a set of N measures for which |supp(P)| = S', — 

N + 1 = W: Choose Pi to have W support points and uniformly distributed mass ^ on each of 
these points. Choose the other P, to have a single support point of mass 1. Then |5| = W and the 
barycenter uses all of these possible support points with mass ^. 

A particularly frequent setting in applications is that all the Pi are supported on the same discrete 
grid, uniform in all directions, in R d . See for example mm for applications in computer vision with 
d = 2. In this situation, the set S of possible centroids is a finer uniform grid in R rf , which allows us 
to strengthen the results in Proposition [l] and Theorem [2] 

Corollary 1 Let P\ , ■ ■ ■ , P/V be discrete probability measures supported on an L\X ... xL^-grid, uniform 
in all directions, in R d . Then there exists a barycenter P supported on a refined (N(L\ — 1) + 1) x ... x 

_ d 

(N(L t i — 1) + 1 )-grid, uniform in all directions, with \supp(P)\ < N( Li — 1) + 1. In particular, the 

2=1 

density of the support of the barycenter on this finer grid is less than 


n 


N^ 1 A = 1 (Li - 1 )' 


3 Proofs 


In this section we prove the results outlined in Section [2] We begin with a proof of Proposition [l] 
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3.1 Existence of Discrete Barycenters 

Recall that a discrete barycenter P is an optimizer of ([3| when Pi,..., Py are discrete probability 
measures. We will show that P must have the form of a coordinatewise average of optimally coupled 
random vectors with marginals given by the P,. In particular, we will establish the existence of N 
random vectors X°,... , Ay with marginal distributions X° ~ P, that are as highly correlated as 
possible so that the variability in the average X° = =— jX— K is maximized. Once these coupled 
random vectors A°,..., Ay are obtained, the distribution of the average X° (denoted CX°) will serve 
as P. 


Proof (of Proposition [7]) As remarked earlier, part [7p of Proposition [l] follows from the general re¬ 
sults of Kellerer m and Rachev [23\ Therefore there exists an optimally coupled random vector 
(A°,..., Ay) £ II (Pi, ..., Py) which satisfies (|5|). We will show that 


N 

^W 2 (£X^,p) 2 

i= 1 


N 


inf 

Pev 2 (M 


y w 2 (p,p ) 2 


(9) 


Notice the definition of S automatically implies supp(£X°) C S so that (|9j) will imply 

N N N 

yw 2 (£x^,p) 2 = inf yw 2 (P,P l ) 2 = inf Vw 2 (P,P) 2 
'Ps( Rd ')~l p G p 2 (R d )“( 


( 10 ) 


and complete the proof of part [^). 

So suppose P £ P 2 (R d ). Then for all * = 1 there exists an optimally coupled random 

vector (Y* , X* ) £ 77(P,Pj) such that W 2 (P, Pi) 2 = E\Y* — A'*| 2 . (This is a well known property of 
the Wasserstein distance W 2 , see for example Proposition 2.1 in [26].) Since the random variables 
Y *,..., Y* all have distribution P it is easy to see that there exists a generalized Gluing lemma for 
the existence of a random vector (Y, Ai,..., Ay) £ 27 (P, Pi,..., Py) such that (Y, A^) has the same 
distribution as (Y*, A*) for each i. This can be seeing by first sampling a single Y ~ P then sample 
Ai,..., Ayr independently conditional on Y where the conditional distribution Pr(Aj = x\Y = y ) is 
set to Pr(A* = x\Y* = y ) (the finite support of Pi,...,Py is sufficient to guarantee existence of 
these conditional distributions). This yields 


N 


N 


N 


± y w 2 ( P , p ) 2 = ± y e\y* - a* i 2 = i y E \y - Xi 

i =1 i =1 i=l 

Now note that X{ ~ F\ and X° ~ PThus 


( 11 ) 


N 


N 


N 


N 


N 


\X°-X°\ =^2e\X°\ -2E^2(X°,X°)+^2e\X°\ = -NE\X°\ +^2e\X°\ 


N 


2=1 

N 


= -NE\X°\ + S^ElxA = inf -NElxl + 

(x 1 ,...,x N ) 11 trf 1 1 


inf 

(x u ...,x N ) 
e n(p u ...,p N ) ‘ 


e n(p 1 ,...,p N ) 

N N N 

yp|A| 2 -2py(A,A l )+yp|A l | 


N 


inf 

(AA,...,-Y a , 

£ n(Pi ,. . . , -Prv) 


yp|A-A,|. 


( 12 ) 


Also, note that 


P|A°-Af| 2 > inf_ E\Y - X\ 2 = W 2 {CX°,P l ) 2 . 

( Y,x)en(CX°,Pi) 


Combining (12 I and (131, we get 


(13) 


N N N 

l y pia - a,i 2 > l y p|a° - a° i 2 > l y w 2 (cx*, pf. 


(14) 
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Further we have a minorant for the right hand side of (11) as follows 
N 


1 £ E\Y - X z \ 2 = i £ E\Y -X + X- Xi\ 2 

i= 1 i=l 

1 N _ o N _ 1 JV _ 

= mY, E \ Y ~ X \ 2 + N E T,( Y X ’ X - A '*> + v E E \ X X ' 


2=1 

N 


= E\Y - X \ 2 + ^E(Y - X, £(X - X,)} + -t £ £|X - Xj 

2=1 2=1 
iV _ i ^ _ 

= s|y - x| 2 + - £p|x -x,f > - £p|x - x,| 2 . 

2=1 2=1 


Putting (11), (14), and (15) together we obtain 

N „ AT 


2=1 


2=1 


(15) 


/V /V i N _ N _ 

^£w 2 (P,p l ) 2 = ^£s|y-x,| 2 > -£x|x-x l | 2 > - £ir 2 (£x°,p ? ;) 2 - (ie) 


This shows that CX° is a minimizer of our problem and hence a barycenter, proving part [^j. 

Finally, to prove part note that if P £ P 2 (R d ) and supp(P) (7 S, then any coupling 
(Y,X i,..., Xjv) € n(P, Pi, ..., P iV ) must satisfy E\Y — X\ 2 > 0 (since supp(X) C S and supp(P) ^ S). 
This implies, by the last line of (15 I, that 


! " i N ___ 

£p|y-Xi| 2 > ^£p|x-x !; | 




(17) 


and hence that 

/V JV 1 N _ N _ 

N £ w 2 (p,p t ) 2 = - £ P|y - x *| 2 > - £p|x - x ,| 2 > - £ w 2 {cxo, Pi f, (18) 

1=1 1=1 2 = 1 1=1 

so that P is not a barycenter. Therefore for any barycenter P, we must have supp(P) C S, which 
proves part [iii]). □ 


3.2 Linear Programming and Optimal Transport 

Let us now develop a linear programming model (LP) for the exact computation of a discrete barycen¬ 
ter. Suppose we have a set of discrete measures P l: i = 1,...,N, and additionally another discrete 
measure P. Let So = |supp(P)| and S, = |supp(Pj)| for each i as before. Let Xj, j = 1,..., So be the 
points in the support of P, each with mass dj, and let xn. -, k = 1 Si be the points in the support 
of Pi, each with mass di k - For the sake of a simple notation in the following, when summing over 
these values, the indices take the full range unless stated otherwise. 

If (X, Yj) £ n{P,Pi), then this coupling can be viewed as a finite matrix, since both probability 
measures are discrete. We define y,jk > 0 to be the value of the entry corresponding to the margins 
Xj and Xik in this finite matrix. 

Note in this coupling that 7^,. y^ k = dj for all j and that ijijk = d ik for all k and further that 
E\X I |~ ^ \Xj Xik | • yijk ^ ) Cjjk ' Vijki (19) 

j,k j,k 


where c^jf. := |Xj — x ^| 2 just by definition. 

Given a non-negative vector y = (yijk) > 0 that satisfies ^kVijk = dj for all i and j and 
yT yijk = d^ f° r a ll * and k, we call y an N-star transport between P and the P^. We define the cost 
of this transport to be c(y) := V ( _ A . r, ,,.. • y ijk . 

Further there exist vectors (X*,Y*) £ 77(P,Pj) for all i, and a corresponding N-star transport 
y*, such that 


£ W 2 (P, Pi) 2 = £ P|X* - y*| 2 =» c (y*). 


(20) 
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For any (X, Yj) £ 17 (P, Pf) we also have E\X * — Y*\ 2 < E\X — Yi | 2 , and hence it is easily seen that 
y* is an optimizer to the following linear program 


min c(y) 
y 

^ ' Vijk dj , Vi 1,..., N, V j 1,..., So, 

k 

^ ' Vijk dik, Vi 1 5 - - - ? A" 5 Vfc 1, - - -, 

i 

Vijk > 0, Vi = 1,..., IV, Vjr = 1,..., S 0 , Vfc = 1,..., Sj. 


( 21 ) 


Now suppose we wish to find a barycenter using a linear program. Then using Proposition [T] we 
know that this amounts to finding a solution to 


N 


PEVg(R d ) ~y l 


E^mpp ) 2 


P = ^2 -Sx^x, 2 x G K>0- 

xG5 


( 22 ) 


Using this we can expand the possible support of P in the previous LP to S, and let the mass at 
each Xj £ S be represented by a variable Zj > 0. This is a probability distribution if and only if 
the constraint JT Z j = 1 is satisfied. Then every exact barycenter, up to measure-zero sets, must be 
represented by some assignment of these variables and hence is an optimizer of the LP 


min 

c(y) 




y,z 

E yij k — 

z v 

Vi = 1,.. 

■,N, Vj = 1,.. 

;S 0 , 

k 

E Vijk ~ 

i 

j 

Vi = 1,. 

,.,N, Vfc = 1,. 


Al 

-se 

■5 

0, 

Vi = 1,. 

..,N, Wj = 1, 

■ ■ ■ ,So, 

IV 

0, 

V. j = 1, ■ 

■ ■,S 0 . 



(23) 


Since each P 4 is a probability distribution it is easy to see that JT z-j = 1 is just a consequence of 
satisfaction of the other constraints. Any optimizer (y*,z*) to this LP is a barycenter P in that 


N 


N 


Pevg( R d ) 


W 2 (P, P) 2 = ]T W MP P) 2 = c(y* 


and 


^ = E ; 


(24) 


It is notable that the LP in (23) corresponds to N transportation problems, linked together with 
variables Zj , representing a common marginal for each transportation problem. In fact it is not hard 
to show that in the case N = 2 this LP can be replaced with a network flow LP on a directed 
graph. It is easily seen that this LP is both bounded (it is a minimization of a positive linear sum of 
non-negative variables) and feasible (assign an arbitary Zj = 1 and the remainder of them 0 and this 
reduces to solving N transportation LPs). Thus it becomes useful to write down the dual LP, which 
also bares similarity to a dual transportation problem 


max 

t,o 

E dik ■ Tik 




i,k 




@ij H - — 

^ijki ^ 1? • 

■■,N, Vj = 1,. 

■ ■ ,So, Vfc = l,...,p, 


M 

<2; 

IV 

0, Vz = 1,. 

..,N, Vj = 1,. 

■■,So, 

(25) 


j 


where there is a variable for each defining measure i and each Xik £ supp(Pj) and a variable 9ij 
for each defining measure i and each Xj £ S. 

These LPs not only will be used for computations in Section [4j but also can be used to develop 
the necessary theory for Theorem |T| 


Lemma 1 Let Pi ,P N be discrete probability measures with a barycenter P given by a solution (y*, z*) 
to 1 23). Then 

i) For any Xj £ supp(P) (i.e. Zj > Oj combined with any choice of a:,/;. £ supp(Pi) for i = 1 ,... ,N 
such that y*jk > 0 for each i, one then has Xj = JT Xjj-i. 





Discrete Wasserstein Barycenters: Optimal Transport for Discrete Data 


9 


ii) For any Xj £ supp(P) and i = 1, ..., N, one has | {y*j k > 0| Xi k £ supp(Pi)} | = 1. 

Proof i) Suppose the statement in i) is false. Then there exists an Xj 0 £ supp(P) and there are points 
x. ik . G supp (Pi) for i = 1,..., N such that y*j ok . > 0 for each i and Xj 0 ^ ^ JT x i ki . 

Let a = min iV* jok . > 0 and let Xj . = jfJ2i x iki- Then define (y,z) such that y ijok . = y* jgk . - a 
for each i, t)ij* ki = Vij* ki + Q f° r each i, 2j 0 = z* Q — a, Zj * = z*, + a, and Zj = Zj and yij k = y*j k for 
all other variables. 


It is easily checked that (y,z) is also a feasible solution to (23). Further 


c(y) = c(y*) + a (^2c i:j , ki - < c(y*), 


where the strict inequality follows since Xj 0 ^ ^ JT x ik, = x j * and therefore 

^ ' Cijoki ^ ' \ x jo x iki | ^ ^ ' \ x j* x iki I ^ ^ * k-; i 


(26) 


(27) 


which is a contradiction with P being a barycenter. 

ii) If Xj £ supp(P), then z* > 0 and therefore | {y*j k > 0| xn- G supp(Pi)} | > 1 for all i is an 
immediate consequence of the contraints in (23). Suppose rhe statement is false, then there is some 


Xj £ supp(P) such that, without loss of generality, \{y\j k > 0| xi k £ supp(Pi)}| > 2. Then we can 
choose Xu-' i=- X\ k " such that y*j k i , y( jk u > 0 and further can choose x.- lk . for i = 2,... ,N such that 
Vijki > 0 for each i. Then this implies, by part ( i ), that 


jy { x lk' T ^ ^ ikj ) x j jy( x lk" + ^ ) x ikj ); (^8) 

i=2 i =2 

which in turn immediately would imply x\ k i = X\ k n\ a contradiction with our choice of x\ k : ^ x kk n. 
Hence [ {y\ jk > 0| x ik £ supp(Pi)} [ = 1. □ 

Lemma [I] already implies that there exists a transport from any barycenter P to each Pi. However, 
to prove Theorem [l] we need the concept of strict complimentary slackness. If you have a primal LP 
{min c T x\ Ax = b, x > 0} which is bounded and feasible and its dual LP {maxb T y| A r y < c}, then 
complimentary slackness states that the tuple (x*, y*) gives optimizers for both of these problems 
if and only if x*(ci — a, r y*) = 0 for all i , where aj is the i-th column of A. This statement can be 
strengthened in form of the strict complimentary slackness condition [29] : 

Proposition 2 Given a primal LP {minc T x| Ax = b, x > 0} and the corresponding dual LP 
{maxb T y| A 1 y < c}, both bounded and feasible, there exists a tuple of optimal solutions (x*,y*), to the 
primal and dual respectively, such that for all i 

x* (Ci - a; T y*) = 0, x* + (Cj - a; T y*) > 0. (29) 


With these tools, we are now ready to prove Theorem [l] 


Proof (Proof of Theorem [7J) Let (y* ,z* ,t* ,9*) be a solution to (23) and (251, as guaranteed by Propo¬ 
sition [ 2 ] Let P be a barycenter corresponding to the solution (y,z). For each Xj £ supp(P) let 
x ikj G supp(Pi) be the unique location such that yij kj > 0 as guaranteed by Lemma |T] part [7i|). Now 
for each i define 

1 , , 2 . 1 * 


Ipi{x) = max (x,x ik ) - -\x ik \ 

®ifcGsupp(Pi) Z 


+ Ifik- 


(30) 


Using Lemma [l] part [i]), it is easy to see that for proving part of Theorem [l] it suffices to show 

that for each 1 pi we have that X7ipi(xj) = Xi kj for each Xj £ supp(P). 

By definition, each ipi is convex (as the maximum over a set of linear functions) and i/Ji(x) is finite 
for all x £ R d . Further 


\ x \ 2 ~ 2ipi(x) = \x\ 2 - 2 max {x,x ik )-\\x ik \ 2 + \r* k 

XikSsupp(Pi) z z 

= min \x\ 2 - 2(x,x ik ) + \x ik \ 2 - r* k 

x ifc esupp(P;) 

= min \x - x ik \ 2 - r* k , (31) 

x ik es\ipp(Pi) 







10 


Ethan Anderes et al. 


and hence 

\xj\ 2 -2ipi(xj) = min \xj - x ik \ 2 - r* k = min c ijk - r* k . (32) 

XifcGsupp(Pi) £CifcGsupp(Pi) 

By complimentary slackness, we have that since yij kj i=- 0, that Cj j kj — r* k . — 9*j = 0. Therefore by 
strict complimentary slackness we get y*j kj ^ 0 and hence by Lemma llj part pill we get y*j k = 0 for all 
k kj. This implies by strict complimentary slackness that for all k ^=~kj we obtain c i3k — r* k — 9*j ^ 0 
and therefore, by feasibility, that c i3k — T* k < 9*j. Factoring in that Cjj k . —r* k . = 9*j by complimentary 
slackness we have that \xj\ 2 — 2ipi(xj) = 0*j. Further, since the function corresponding to kj is the only 
continuous function in the minimization that achieves this minimum at Xj (by the above argument), 
we obtain that for x in some neighborhood of xj 

\x\ 2 - 2 ipi(x) = \x- x ik .\ 2 - T* k . , 

=> tpi(x) = (x,X ikj ) - i \x ikj \ 2 + i T * k ., 

=> VV'i(x) = X ikj , (33) 

(34) 

so that Vipi(xj) = Xj kj . Further, note that complimentary slackness implies = 0 f° r Gach 

Xj £ supp(P) and hence 

0 = E e *i = E l^l 2 - 2 ^(*i) (35) 

i i 

=> = ~2~- ( 36 ^ 

i 

This shows part [hj] of Theorem [l] and thus completes the proof. □ 


3.3 Sparsity and Transportation Schemes 


As before, let Pi,, P.y be discrete probability measures, with point masses d ik for Xi k £ supp(Pi) 
defined as in the previous subsection. Then for any set S C S x supp(Pi) x ... x supp(P.v) we fix 
an arbitary order on S, i.e. S = {si, S 2 , • • •, s m } where each Sh = (qho, Qhi, ■ ■ ■, QhN), and define a 
location-fixed transportation scheme as the set 

m 

T{S) := {w £ RgJ| Y, ™ h = d ik ,Vi = l,...,N,Vk=l,...,Sj}. (37) 

h= 1, 
qhi=xik 

Informally, the coefficients of w £ T(S) correspond to an amount of transported mass from a given 
location in S to combinations of support points in the Pi, where each of these support points receives 
the correct total amount. Given a w, we define its corresponding discrete probability measure 

m 

P(w, 5) := Y w hSq h0 , (38) 

h= 1 

and the cost of this pair (w, S) 

m N 

c(w,5) := Y°h w h, c h = Y \lho - Qhi\ 2 - (39) 

h =1 2=1 


In the following, let supp(w) denote the set of strictly positive entries of w. Informally, we now give a 
translation between IV-star transports, the feasible region of (21), and location-fixed transportation 
schemes. 


Lemma 2 Given S C S x supp(Pi) x ... x supp(Pn) such that T(S) 0: 

i) For each w £ T(S), P(w,S) is a probability measure with \supp(P(w, S))| < |swpp(w)|. 

ii) For each w £ T(S), there exists an N-star transport y between P(w,S) and P\ ,... ,P N such that 
c(w,5) = c(y). 

Hi) For every discrete probability measure P supported on S and N-star transport y between P and 
Pi,.. ., Pm there exists a pair (w ,S r ) such that: w £ T(S'), P = P(w,S'), and = c(y). 
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Proof i) |supp(P(w,<S))| < |supp(w)| is clear by definition (note that strictness of this inequality 
can occur if there exist non-zero for which q k0 = q^'o)- To see that P(w,<S) is a probability 

measure it suffices to show that Ylh=i w h = 1- This holds since for any i = 1,... ,N we have 

m Si Si 

E ^ = E E w h = Y2 dik = 1; ( 4 °) 

h=l k= 1 h, k= 1 

qhi=x ik 

since the P t are probability measures. 

ii) For each i = 1,..., N, j = 1,..., |£|, k = 1,...., Si define 

m 

Vijk = E Wh ' ( 41 ) 

h= 1 
QhO = 3'j 
Qhi=Xik 

Clearly y.ij k > 0 and it is easily checked that yT y t jk = djfc for any * and k and that y t j k is the 
mass at location Xj £ S in the measure P(w,<S). 

Hence y is an IV-star transport between P(w,<S) and Pi ,P N . Further we have 


m 

c (y) = e \ x 3 - x ik \ 2 • Vijk = e i xj - x ik \ 2 ■ e w h 

i,j,k i,jyk h =1 

qho =x j 
qhi=Xik 

N m m 

= EE Who - ihi \ 2 • Wh = E ° hWh = c ( w ’ 5 )- 

2=1 h= 1 h=l 


N m 

EE E Wm-< ihi\ 2 ■ w h 

2=1 j,k h= 1 
qho = %j 
qhi = 3'ik 


(42) 


Hi) We note first that all of our arguments up to now not only hold for Pi and P being proba¬ 
bility measures, but for any measures with total mass 0 < r < 1 that is the same for all Pi and P. 
Using this fact we prove this part of the lemma for these types of measures by induction on |supp(y)|. 

For |supp(y)| = 0, we clearly have that any S paired with w = 0 satifies the given conditions. 
So suppose |supp(y)| > 0, then let y = nain yijk> o yij k and let ( i*,j*,k*) be a triplet such that 
yi*j*k* = a ‘ r S. rn ^ n vijk>o Vijk- This implies that dj * > y and so for each i = 1 there exists a 

hi such that Vij* ki — P- In particular one can choose hi * = k* here. We then have a vector y' with 
y'ij*ki = Vipki ~ P an d y'ijk = Vijk otherwise. Then y' is an IV-star transport for P' to P[,...,P r N 
where P' is obtained from P by decreasing the mass on Xj* by y and each P[ is obtained from P> by 
decreasing the mass on xi k . by y. Then |supp(y')| < |supp(y)| since y\*j<, k * = *-*• 

Therefore, by induction hypothesis, there exists a pair (w, S') such that w £ T(S'), P' = P(w,S'), 
and c(w, S') = c(y') for P[,...,P' N . Let now |S'j = m and let s m + 1 = (xj*,x lkl ,... ,x iki ,... ,x NkN ) 
and define S = S U {sm-t-i }. Then (w 7 ,y) £ T(S) and P = P((w T ,p),S) for Pi,... ,P N . Further we 
have that 


N N 

c(y) = c(y') + = c (y0 + E \ X P ~ x iki\ 2 P 

2=1 2=1 

= c(w T ,5') + c m +iP = c((w,p),5), (43) 

which completes the proof by induction. □ 

We now show the existence of a transportation scheme w* for which |supp(w*)| is provably small. 

Lemma 3 Given a location-fixed transportation scheme T(<S) 0 for discrete probability measures 

Pi,... ,P N , there exists w* £ argmin wg 7 -( 5 ) c(w,S), such that 

N 

| suppfw* ) | < E Si - N + 1. 

i —1 


(44) 



12 


Ethan Anderes et al. 


Proof We have that min we 7 -( 5 ) c(w,5) is equivalent to the following LP by definition: 

min c(w,S) 

w 

^ ) Wh d%k 5 1,..., IV, \/k 1, . . . , Sj, 

h. 

qhi=xik 

Wh>0, Vh=l,...,m. (45) 


Thus there is a basic solution to this problem w* G T(<S) such that |supp(w*)| is bounded above by 
the rank of the matrix of the equality constraints in the first line. 

Since there are X^'S'i = X); |supp(i = \) | of these equality constraints by definition, it suffices to 
show that at least IV — 1 of these constraints are redundant. Let a ifc denote the row corresponding 
to the equation for some i and 1 < k < Si- Note that for a fixed i, XX a,-*. yields a vector of all ones, 
as Wh appears in exactly one equation for each fixed i. Hence it is immediate that the row a.jg. is 
redundant for all i = 2,..., N since 

Si-1 Si Si-1 

a iSi = 1 ~ &ik = / . &lk — &ik ’ (46) 

fc=1 k =1 k=1 


where 1 is the row vector of all-ones. Hence we get N — 1 redundant rows. □ 
We are now ready to prove Theorem [2] 


Proof (of Theorem [X|) Since all barycenters are a solution to (23), there exists an IV-star transport y' 
from some barycenter P' to Pi ,...,P N and c(y') = J2tLi W 2 {P', Pi) 2 - By Lemma 1 part there 
is some location-fixed transportation scheme T(S) for Pi ,P N and some w' G T(S) such that 
P' = P(w',S) and c(y') = c(w',5). By Lemma [ 3 ] there is some w* G argmin w6 7 -(g) c(w,5) such that 
|supp(w*)| < X)!^=i Si — N + 1. Now let P = P(w*,S), then by Lemmaji] 


N 


|supp(P)| < |supp(w*)| <J2 S i~ N + 1. (47) 

2=1 

Further, by Lemma [ 2 ] part [X| l, there is an TV-star transport y between P and Pi ,P N such that 
N N N 

^fF 2 (P,Pi) 2 <c(y) =c(w*,5) <c(w',5) = C (y') = ^VF 2 (P',Pi) 2 <^W 2 (P,P i ) 2 , (48) 


where the last inequality follows since P' is already a barycenter. Hence this chain of inequalities 
collapses into a chain of equalities and we see that P is the desired barycenter. □ 


Finally, let us exhibit how to refine our results for discrete probability measures arising that are 
supported on an LiX ... xP^-grid in R d that is uniform in all directions. 

Proof (of Corollary [7]) 

An Pi x ... xP^-grid in R d for eo G R d and linearly independent vectors ej, ..., G R d is the set 
d 

{v G R d : v = eo + X) l -i es : 0 < Is < L s — 1, l s £ Z}. Since by Proposition 


we have supp(P) C S, 


for each Xj G supp(P) there exist Xi = e 0 + X) X*-i es with 0 < a s i < L s — 1 for all i < N such that 

S = 1 ” 

N N N d N d 

= V D« = V £■=0 + V £ £ T^T'—•»+£ £ w 


2=1 2=1 2=1 S = 1 2=1 S= 1 

This tells us that supp(P) lies on the (N(L\ — 1) + 1) x ... x (iV(P<j — 1) + l)-grid for eo and ei,. .., e<i. 

Since supp(P^) lies on an PiX ... xP^-grid, the absolute bound on |supp(P)| follows immediately 
from Theorem[2] Since P is supported on a (N(Li — 1) + 1) x ... x (N(Ld — 1) + l)-grid, we observe 
a relative density of less than 


jv(n Li-i)+i 

i =1 


< 


n n p* 

i= 1 


1 n L,: 


d d Nd-i 11 (£ - 1 )’ 

n (Nfa - 1) +1) N* n m -1) 

i =1 i= 1 


in this grid, which proves the claim. □ 


( 50 ) 
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4 Computations 


In this section we apply the computational and theoretical results developed in this paper to a 
hypothetical transportation problem for distributing a fixed set of goods, each month, to 9 Cali¬ 
fornia cities where the demand distribution changes month to month. A Wasserstein barycenter, 
in this case, represents an optimal distribution of inventory facilities which minimize squared dis¬ 
tance/transportation costs totaled over multiple months. Although this data is artificially generated 
for purposes of exposition, the data is based on observed average high temperatures per month 
(28| . All the source code used in this section is publicly available through the on-line repository 
https://github.com/EthanAnderes/WassersteinBarycenterCode 

The probability measures used in this example are defined on R 2 and are denoted Pd ec , Pj an , Pfeb, 
P m ar, Pj un , ^jub Pa ug and P ae p to correspond with 8 months of the year (scaling up to 12 months, while 
not intractable, imposes unnecessary computation burdens for computational reproducibility). The 
support of each distribution is given by the longitude-latitude coordinates of the following 9 California 
cities: Bakersfield, Eureka, Fresno, Los Angeles, Sacramento, San Bernardino, San Francisco, San Jose 
and South Lake Tahoe. The mass distribution assigned to each Pdeo ■ ■ ■, -Psep is computed in two steps. 
The first step calculates 

(population in city C ) x (average high temp for month M - 72°) 2 


for each city C and each month M. The second step simply normalizes these values within each 
month to obtain 8 probability distributions defined over the same 9 California cities. Figure [l] shows 
Pfeb : Pmar, P] 11 n and Pj u l- 

Let P denote an optimal Wasserstein barycenter as defined by Equation ([!]). Proposition [l] and 
Theorem [ 5 ] both give bounds on the support of P uniformly over rearrangement of the mass assigned 
to each support point in P dec , ■ ■ ■, Psep- Proposition [I] gives an upper bound for supp(P) in the form of 
a finite covering set which guarantees that finite dimensional linear programing can yield all possible 
optimal P (see (23)). Conversely, Theorem [ 2 ] gives an upper bound for the magnitude |supp(P)| which 
is additionally uniform over rearrangement of the locations of the support points in Pd ec , ■ ■ •, Psep- 

In the implementation presented here we use the modeling package JuMP ca which supports 
the open-source COIN-OR solver Clp for linear programming within the language Julia [3j. The set 
S, defined in |4]), covers the support of P and is shown in the rightmost image of Figure [5j A typical 
stars and bars combinatorial calculation yields IS) = ( 9 g®7 1 ) = (s 6 ) = 12870. The corresponding LP 
problem for P therefore has 939510 variables with 103032 linear constraints. On a 2.3 GHz Intel 
Core i7 MacBook Pro a solution was reached after 505 seconds (without using any pre-optimization 
step). The solution is shown in the leftmost image of Figure [ 2 ] Notice that Theorem [ 2 ] establishes 
an upper bound of 65 = 9-8 — 8 + 1 for |supp(P)|. The LP solution yields |supp(P)| = 63. Not 
only does this give good agreement with the sparsity bound from Theorem [2] but also illustrates that 
Wasserstein barycenters are very sparse with only 0.5% of the possible support points in S getting 
assigned non-zero mass. 

In Figure [3] we illustrate Theorem [I] which guarantees the existence of pairwise optimal trans¬ 
port maps from P to each P dec ,..., P sep which do not split mass. The existence of these discrete 
non-mass-splitting optimal transports is a special property of P. Indeed, unless special mass balance 
conditions hold, there will not exist any transport map (optimal or not) between two discrete prob¬ 
ability measures. The implication for this example is that all the inventory stored at a barycenter 
support point will be optimally shipped to exactly one city each month. Moreover, since the trans¬ 
portation displacements must satisfy Theorem Tj[i Ti ) each city is at the exact center of its 8 monthly 
transportation plans. 
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